Wednesday, December 15, 2021

Exploring large datasets locally with Vaex

Lately I came across an interesting Python library worth investing time in exploring it. The project is promising it answers many of the problems every data analyst/scientist encountered while exploring large datasets in his local machine with Pandas.

In this relatively shallow blogpost we will go through Vaex, we'll also see how Pandas and Vaex have similar APIs.



 Here's how it's going to be laid out:
  1. Why Vaex
  2. Vaex components
  3. Exploring data with Vaex
  4. Visualization with Vaex
  5. Machine learning with Vaex
  6. POC @ our company

 1. Why Vaex


You all start your exploration journey with Pandas and check data types, missing values, summary statistics make some plots and so on. 

Now you might have experienced memory issues with relatively large datasets, right ? In fact Wes McKinney the creator of Pandas wrote an interesting article about the downsides of Pandas. And basically what he's pointing out is the excessive usage of RAM as well inability of Pandas to use multiple cores.

Vaex responds to this issue with mainly three things:
  • Memory mapping.
  • Lazy evaluation with expressions.
  • Multiprocessing.

Thus Vaex let's you visualize and explore big tabular datasets. I quote "It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion objects/rows per second".  

Memory mapping helps to avoid copying every time to memory, parts of the file content are copied and mapped to disk.

 2. Vaex components

Vaex has different packages to enable full exploratory analysis capabilities.

Vaex-core: DataFrame and core algorithms, takes numpy arrays as input columns.
Vaex-hdf5: memory mapped numpy arrays to a DataFrame.
Vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.
Vaex-ml: machine learning


You can refer to the documentation for more details.


3 .Exploring data with Vaex


In this section we will use Vaex API and make some explorations on the NYC taxi dataset. I will provide a link to download the data and a second link for a jupyter notebook to follow along. 

We'll start slow by making some elementary statistics, plots. Then we'll get to the part where Pandas wastes a lot of your RAM (filters, selections and subsets).

You'll notice along that one amazing thing about vaex aside from it's speed is that it's API is similar to Pandas !



Starting by reading the dataset I recommend that you use an hdf5 or arrow file to support memory mapping for faster computations, if you don't have an hdf5 file don't worry you can convert your csv.
 

df = vaex.open('taxi/yellow_tripdata_2015-07.csv', convert=True)


Filtering is easy as this.



Now before moving to visualizations I would like to emphasize on expressions as I said before, everything in Vaex is expressions.



 Expression are python strings that get computed when asked for. And filters result in a reference to the existing data plus a boolean mask to keep track of which rows are selected.

4. Visualization with Vaex


Visualizations in Vaex let's you plot millions of rows in a single plot and fast. Personally I'm fond of plot_widget with ipyleaflet backend, it is really useful with geo-location data.


5. Machine learning 

 There's an extension to vaex for machine learning, it's highly simlar to scikit-learn, but so far it doesn't support a lot of learning algorithms.

6. POC @ client


Few weeks ago, I made a small POC to present Vaex to my coworkers. They were very excited about it and they had many questions about the internals of Vaex. But the most important part is that we started to brainstorm together on how we could use Vaex to facilitate tasks specific to Data Engineering.

Spark guys wanted to use vaex for data transformation since it is really fast, machine learning experts wanted to know if vaex api is similar to scikit-learn. It actually enriched the conversation.

Back to the POC, it was a combination of our PAAS datalake which is entirely on AWS. My presentation was about using our platform to ingest data and start exploring it with Vaex and eventually build a machine learning model.

The stack I used was a native glue crawler which ingest data from our platform to S3 and create tables of it in glue catalogue, then I used sagemaker with a custom virtualenv

Post a Comment

Whatsapp Button works on Mobile Device only

Start typing and press Enter to search